{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Homework and bake-off: Word similarity"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "__author__ = \"Christopher Potts\"\n",
    "__version__ = \"CS224u, Stanford, Fall 2020\""
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Contents\n",
    "\n",
    "1. [Overview](#Overview)\n",
    "1. [Set-up](#Set-up)\n",
    "1. [Dataset readers](#Dataset-readers)\n",
    "1. [Dataset comparisons](#Dataset-comparisons)\n",
    "  1. [Vocab overlap](#Vocab-overlap)\n",
    "  1. [Pair overlap and score correlations](#Pair-overlap-and-score-correlations)\n",
    "1. [Evaluation](#Evaluation)\n",
    "  1. [Dataset evaluation](#Dataset-evaluation)\n",
    "  1. [Dataset error analysis](#Dataset-error-analysis)\n",
    "  1. [Full evaluation](#Full-evaluation)\n",
    "1. [Homework questions](#Homework-questions)\n",
    "  1. [PPMI as a baseline [0.5 points]](#PPMI-as-a-baseline-[0.5-points])\n",
    "  1. [Gigaword with LSA at different dimensions [0.5 points]](#Gigaword-with-LSA-at-different-dimensions-[0.5-points])\n",
    "  1. [Gigaword with GloVe [0.5 points]](#Gigaword-with-GloVe-[0.5-points])\n",
    "  1. [Dice coefficient [0.5 points]](#Dice-coefficient-[0.5-points])\n",
    "  1. [t-test reweighting [2 points]](#t-test-reweighting-[2-points])\n",
    "  1. [Enriching a VSM with subword information [2 points]](#Enriching-a-VSM-with-subword-information-[2-points])\n",
    "  1. [Your original system [3 points]](#Your-original-system-[3-points])\n",
    "1. [Bake-off [1 point]](#Bake-off-[1-point])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Overview\n",
    "\n",
    "Word similarity datasets have long been used to evaluate distributed representations. This notebook provides basic code for conducting such analyses with a number of datasets:\n",
    "\n",
    "| Dataset | Pairs | Task-type | Current best Spearman $\\rho$ | Best $\\rho$ paper |   |\n",
    "|---------|-------|-----------|------------------------------|-------------------|---|\n",
    "| [WordSim-353](http://www.gabrilovich.com/resources/data/wordsim353/) | 353 | Relatedness | 82.8 | [Speer et al. 2017](https://arxiv.org/abs/1612.03975) |\n",
    "| [MTurk-771](http://www2.mta.ac.il/~gideon/mturk771.html) | 771 | Relatedness | 81.0 | [Speer et al. 2017](https://arxiv.org/abs/1612.03975) |\n",
    "| [The MEN Test Collection](https://staff.fnwi.uva.nl/e.bruni/MEN) | 3,000 | Relatedness | 86.6 | [Speer et al. 2017](https://arxiv.org/abs/1612.03975)  | \n",
    "| [SimVerb-3500-dev](https://www.aclweb.org/anthology/D16-1235/) | 500 | Similarity | 61.1 | [Mrki&scaron;&cacute; et al. 2016](https://arxiv.org/pdf/1603.00892.pdf) |\n",
    "| [SimVerb-3500-test](https://www.aclweb.org/anthology/D16-1235/) | 3,000 | Similarity | 62.4 | [Mrki&scaron;&cacute; et al. 2016](https://arxiv.org/pdf/1603.00892.pdf) |\n",
    "\n",
    "Each of the similarity datasets contains word pairs with an associated human-annotated similarity score. (We convert these to distances to align intuitively with our distance measure functions.) The evaluation code measures the distance between the word pairs in your chosen VSM (which should be a `pd.DataFrame`).\n",
    "\n",
    "The evaluation metric for each dataset is the [Spearman correlation coefficient $\\rho$](https://en.wikipedia.org/wiki/Spearman%27s_rank_correlation_coefficient) between the annotated scores and your distances, as is standard in the literature. We also macro-average these correlations across the datasets for an overall summary. (In using the macro-average, we are saying that we care about all the datasets equally, even though they vary in size.)\n",
    "\n",
    "This homework ([questions at the bottom of this notebook](#Homework-questions)) asks you to write code that uses the count matrices in `data/vsmdata` to create and evaluate some baseline models as well as an original model $M$ that you design. This accounts for 9 of the 10 points for this assignment.\n",
    "\n",
    "For the associated bake-off, we will distribute two new word similarity or relatedness datasets and associated reader code, and you will evaluate $M$ (no additional training or tuning allowed!) on those new datasets. Systems that enter will receive the additional homework point, and systems that achieve the top score will receive an additional 0.5 points."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Set-up"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from collections import defaultdict\n",
    "import csv\n",
    "import itertools\n",
    "import numpy as np\n",
    "import os\n",
    "import pandas as pd\n",
    "from scipy.stats import spearmanr\n",
    "import vsm\n",
    "from IPython.display import display"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "VSM_HOME = os.path.join('data', 'vsmdata')\n",
    "\n",
    "WORDSIM_HOME = os.path.join('data', 'wordsim')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Dataset readers"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "def wordsim_dataset_reader(\n",
    "        src_filename,\n",
    "        header=False,\n",
    "        delimiter=',',\n",
    "        score_col_index=2):\n",
    "    \"\"\"\n",
    "    Basic reader that works for all similarity datasets. They are\n",
    "    all tabular-style releases where the first two columns give the\n",
    "    word and a later column (`score_col_index`) gives the score.\n",
    "\n",
    "    Parameters\n",
    "    ----------\n",
    "    src_filename : str\n",
    "        Full path to the source file.\n",
    "\n",
    "    header : bool\n",
    "        Whether `src_filename` has a header.\n",
    "\n",
    "    delimiter : str\n",
    "        Field delimiter in `src_filename`.\n",
    "\n",
    "    score_col_index : int\n",
    "        Column containing the similarity scores Default: 2\n",
    "\n",
    "    Yields\n",
    "    ------\n",
    "    (str, str, float)\n",
    "       (w1, w2, score) where `score` is the negative of the similarity\n",
    "       score in the file so that we are intuitively aligned with our\n",
    "       distance-based code. To align with our VSMs, all the words are\n",
    "       downcased.\n",
    "\n",
    "    \"\"\"\n",
    "    with open(src_filename) as f:\n",
    "        reader = csv.reader(f, delimiter=delimiter)\n",
    "        if header:\n",
    "            next(reader)\n",
    "        for row in reader:\n",
    "            w1 = row[0].strip().lower()\n",
    "            w2 = row[1].strip().lower()\n",
    "            score = row[score_col_index]\n",
    "            # Negative of scores to align intuitively with distance functions:\n",
    "            score = -float(score)\n",
    "            yield (w1, w2, score)\n",
    "\n",
    "def wordsim353_reader():\n",
    "    \"\"\"WordSim-353: http://www.gabrilovich.com/resources/data/wordsim353/\"\"\"\n",
    "    src_filename = os.path.join(\n",
    "        WORDSIM_HOME, 'wordsim353', 'combined.csv')\n",
    "    return wordsim_dataset_reader(\n",
    "        src_filename, header=True)\n",
    "\n",
    "def mturk771_reader():\n",
    "    \"\"\"MTURK-771: http://www2.mta.ac.il/~gideon/mturk771.html\"\"\"\n",
    "    src_filename = os.path.join(\n",
    "        WORDSIM_HOME, 'MTURK-771.csv')\n",
    "    return wordsim_dataset_reader(\n",
    "        src_filename, header=False)\n",
    "\n",
    "def simverb3500dev_reader():\n",
    "    \"\"\"SimVerb-3500: https://www.aclweb.org/anthology/D16-1235/\"\"\"\n",
    "    src_filename = os.path.join(\n",
    "        WORDSIM_HOME, 'SimVerb-3500', 'SimVerb-500-dev.txt')\n",
    "    return wordsim_dataset_reader(\n",
    "        src_filename, delimiter=\"\\t\", header=False, score_col_index=3)\n",
    "\n",
    "def simverb3500test_reader():\n",
    "    \"\"\"SimVerb-3500: https://www.aclweb.org/anthology/D16-1235/\"\"\"\n",
    "    src_filename = os.path.join(\n",
    "        WORDSIM_HOME, 'SimVerb-3500', 'SimVerb-3000-test.txt')\n",
    "    return wordsim_dataset_reader(\n",
    "        src_filename, delimiter=\"\\t\", header=False, score_col_index=3)\n",
    "\n",
    "def men_reader():\n",
    "    \"\"\"MEN: https://staff.fnwi.uva.nl/e.bruni/MEN\"\"\"\n",
    "    src_filename = os.path.join(\n",
    "        WORDSIM_HOME, 'MEN', 'MEN_dataset_natural_form_full')\n",
    "    return wordsim_dataset_reader(\n",
    "        src_filename, header=False, delimiter=' ')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "This collection of readers will be useful for flexible evaluations:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "READERS = (wordsim353_reader, mturk771_reader, simverb3500dev_reader,\n",
    "           simverb3500test_reader, men_reader)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Dataset comparisons\n",
    "\n",
    "This section does some basic analysis of the datasets. The goal is to obtain a deeper understanding of what problem we're solving – what strengths and weaknesses the datasets have and how they relate to each other. For a full-fledged project, we would want to continue work like this and report on it in the paper, to provide context for the results."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "def get_reader_name(reader):\n",
    "    \"\"\"\n",
    "    Return a cleaned-up name for the dataset iterator `reader`.\n",
    "    \"\"\"\n",
    "    return reader.__name__.replace(\"_reader\", \"\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Vocab overlap\n",
    "\n",
    "How many vocabulary items are shared across the datasets?"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "def get_reader_vocab(reader):\n",
    "    \"\"\"Return the set of words (str) in `reader`.\"\"\"\n",
    "    vocab = set()\n",
    "    for w1, w2, _ in reader():\n",
    "        vocab.add(w1)\n",
    "        vocab.add(w2)\n",
    "    return vocab"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "def get_reader_vocab_overlap(readers=READERS):\n",
    "    \"\"\"\n",
    "    Get data on the vocab-level relationships between pairs of\n",
    "    readers. Returns a a pd.DataFrame containing this information.\n",
    "    \"\"\"\n",
    "    data = []\n",
    "    for r1, r2 in itertools.product(readers, repeat=2):\n",
    "        v1 = get_reader_vocab(r1)\n",
    "        v2 = get_reader_vocab(r2)\n",
    "        d = {\n",
    "            'd1': get_reader_name(r1),\n",
    "            'd2': get_reader_name(r2),\n",
    "            'overlap': len(v1 & v2),\n",
    "            'union': len(v1 | v2),\n",
    "            'd1_size': len(v1),\n",
    "            'd2_size': len(v2)}\n",
    "        data.append(d)\n",
    "    return pd.DataFrame(data)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "vocab_overlap = get_reader_vocab_overlap()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "def vocab_overlap_crosstab(vocab_overlap):\n",
    "    \"\"\"\n",
    "    Return an intuitively formatted `pd.DataFrame` giving vocab-overlap\n",
    "    counts for all the datasets represented in `vocab_overlap`, the\n",
    "    output of `get_reader_vocab_overlap`.\n",
    "    \"\"\"\n",
    "    xtab = pd.crosstab(\n",
    "        vocab_overlap['d1'],\n",
    "        vocab_overlap['d2'],\n",
    "        values=vocab_overlap['overlap'],\n",
    "        aggfunc=np.mean)\n",
    "    # Blank out the upper right to reduce visual clutter:\n",
    "    for i in range(0, xtab.shape[0]):\n",
    "        for j in range(i+1, xtab.shape[1]):\n",
    "            xtab.iloc[i, j] = ''\n",
    "    return xtab"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "vocab_overlap_crosstab(vocab_overlap)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "This looks reasonable. By design, the SimVerb dev and test sets have a lot of overlap. The other overlap numbers are pretty small, even adjusting for dataset size."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Pair overlap and score correlations\n",
    "\n",
    "How many word pairs are shared across datasets and, for shared pairs, what is the correlation between their scores? That is, do the datasets agree?"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "def get_reader_pairs(reader):\n",
    "    \"\"\"\n",
    "    Return the set of alphabetically-sorted word (str) tuples\n",
    "    in `reader`\n",
    "    \"\"\"\n",
    "    return {tuple(sorted([w1, w2])): score for w1, w2, score in reader()}"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "def get_reader_pair_overlap(readers=READERS):\n",
    "    \"\"\"Return a `pd.DataFrame` giving the number of overlapping\n",
    "    word-pairs in pairs of readers, along with the Spearman\n",
    "    correlations.\n",
    "    \"\"\"\n",
    "    data = []\n",
    "    for r1, r2 in itertools.product(READERS, repeat=2):\n",
    "        if r1.__name__ != r2.__name__:\n",
    "            d1 = get_reader_pairs(r1)\n",
    "            d2 = get_reader_pairs(r2)\n",
    "            overlap = []\n",
    "            for p, s in d1.items():\n",
    "                if p in d2:\n",
    "                    overlap.append([s, d2[p]])\n",
    "            if overlap:\n",
    "                s1, s2 = zip(*overlap)\n",
    "                rho = spearmanr(s1, s2)[0]\n",
    "            else:\n",
    "                rho = None\n",
    "            # Canonical order for the pair:\n",
    "            n1, n2 = sorted([get_reader_name(r1), get_reader_name(r2)])\n",
    "            d = {\n",
    "                'd1': n1,\n",
    "                'd2': n2,\n",
    "                'pair_overlap': len(overlap),\n",
    "                'rho': rho}\n",
    "            data.append(d)\n",
    "    df = pd.DataFrame(data)\n",
    "    df = df.sort_values(['pair_overlap','d1','d2'], ascending=False)\n",
    "    # Return only every other row to avoid repeats:\n",
    "    return df[::2].reset_index(drop=True)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "if 'IS_GRADESCOPE_ENV' not in os.environ:\n",
    "    display(get_reader_pair_overlap())"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "This looks reasonable: none of the datasets have a lot of overlapping pairs, so we don't have to worry too much about places where they give conflicting scores."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "collapsed": true
   },
   "source": [
    "## Evaluation\n",
    "\n",
    "This section builds up the evaluation code that you'll use for the homework and bake-off. For illustrations, I'll read in a VSM created from `data/vsmdata/giga_window5-scaled.csv.gz`:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "giga5 = pd.read_csv(\n",
    "    os.path.join(VSM_HOME, \"giga_window5-scaled.csv.gz\"), index_col=0)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Dataset evaluation"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "def word_similarity_evaluation(reader, df, distfunc=vsm.cosine):\n",
    "    \"\"\"\n",
    "    Word-similarity evalution framework.\n",
    "\n",
    "    Parameters\n",
    "    ----------\n",
    "    reader : iterator\n",
    "        A reader for a word-similarity dataset. Just has to yield\n",
    "        tuples (word1, word2, score).\n",
    "\n",
    "    df : pd.DataFrame\n",
    "        The VSM being evaluated.\n",
    "\n",
    "    distfunc : function mapping vector pairs to floats.\n",
    "        The measure of distance between vectors. Can also be\n",
    "        `vsm.euclidean`, `vsm.matching`, `vsm.jaccard`, as well as\n",
    "        any other float-valued function on pairs of vectors.\n",
    "\n",
    "    Raises\n",
    "    ------\n",
    "    ValueError\n",
    "        If `df.index` is not a subset of the words in `reader`.\n",
    "\n",
    "    Returns\n",
    "    -------\n",
    "    float, data\n",
    "        `float` is the Spearman rank correlation coefficient between\n",
    "        the dataset scores and the similarity values obtained from\n",
    "        `df` using  `distfunc`. This evaluation is sensitive only to\n",
    "        rankings, not to absolute values.  `data` is a `pd.DataFrame`\n",
    "        with columns['word1', 'word2', 'score', 'distance'].\n",
    "\n",
    "    \"\"\"\n",
    "    data = []\n",
    "    for w1, w2, score in reader():\n",
    "        d = {'word1': w1, 'word2': w2, 'score': score}\n",
    "        for w in [w1, w2]:\n",
    "            if w not in df.index:\n",
    "                raise ValueError(\n",
    "                    \"Word '{}' is in the similarity dataset {} but not in the \"\n",
    "                    \"DataFrame, making this evaluation ill-defined. Please \"\n",
    "                    \"switch to a DataFrame with an appropriate vocabulary.\".\n",
    "                    format(w, get_reader_name(reader)))\n",
    "        d['distance'] = distfunc(df.loc[w1], df.loc[w2])\n",
    "        data.append(d)\n",
    "    data = pd.DataFrame(data)\n",
    "    rho, pvalue = spearmanr(data['score'].values, data['distance'].values)\n",
    "    return rho, data"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "rho, eval_df = word_similarity_evaluation(men_reader, giga5)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "rho"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "eval_df.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Dataset error analysis\n",
    "\n",
    "For error analysis, we can look at the words with the largest delta between the gold score and the distance value in our VSM. We do these comparisons based on ranks, just as with our primary metric (Spearman $\\rho$), and we normalize both rankings so that they have a comparable number of levels."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "def word_similarity_error_analysis(eval_df):\n",
    "    eval_df['distance_rank'] = _normalized_ranking(eval_df['distance'])\n",
    "    eval_df['score_rank'] = _normalized_ranking(eval_df['score'])\n",
    "    eval_df['error'] =  abs(eval_df['distance_rank'] - eval_df['score_rank'])\n",
    "    return eval_df.sort_values('error')\n",
    "\n",
    "\n",
    "def _normalized_ranking(series):\n",
    "    ranks = series.rank(method='dense')\n",
    "    return ranks / ranks.sum()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Best predictions:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "word_similarity_error_analysis(eval_df).head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Worst predictions:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "word_similarity_error_analysis(eval_df).tail()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Full evaluation"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "A full evaluation is just a loop over all the readers on which one want to evaluate, with a macro-average at the end:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "def full_word_similarity_evaluation(df, readers=READERS, distfunc=vsm.cosine):\n",
    "    \"\"\"\n",
    "    Evaluate a VSM against all datasets in `readers`.\n",
    "\n",
    "    Parameters\n",
    "    ----------\n",
    "    df : pd.DataFrame\n",
    "\n",
    "    readers : tuple\n",
    "        The similarity dataset readers on which to evaluate.\n",
    "\n",
    "    distfunc : function mapping vector pairs to floats.\n",
    "        The measure of distance between vectors. Can also be\n",
    "        `vsm.euclidean`, `vsm.matching`, `vsm.jaccard`, as well as\n",
    "        any other float-valued function on pairs of vectors.\n",
    "\n",
    "    Returns\n",
    "    -------\n",
    "    pd.Series\n",
    "        Mapping dataset names to Spearman r values.\n",
    "\n",
    "    \"\"\"\n",
    "    scores = {}\n",
    "    for reader in readers:\n",
    "        score, data_df = word_similarity_evaluation(reader, df, distfunc=distfunc)\n",
    "        scores[get_reader_name(reader)] = score\n",
    "    series = pd.Series(scores, name='Spearman r')\n",
    "    series['Macro-average'] = series.mean()\n",
    "    return series"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "if 'IS_GRADESCOPE_ENV' not in os.environ:\n",
    "    display(full_word_similarity_evaluation(giga5))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Homework questions\n",
    "\n",
    "Please embed your homework responses in this notebook, and do not delete any cells from the notebook. (You are free to add as many cells as you like as part of your responses.)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### PPMI as a baseline [0.5 points]\n",
    "\n",
    "The insight behind PPMI is a recurring theme in word representation learning, so it is a natural baseline for our task. For this question, write a function called `run_giga_ppmi_baseline` that does the following:\n",
    "\n",
    "1. Reads the Gigaword count matrix with a window of 20 and a flat scaling function into a `pd.DataFrame`s, as is done in the VSM notebooks. The file is `data/vsmdata/giga_window20-flat.csv.gz`, and the VSM notebooks provide examples of the needed code.\n",
    "\n",
    "1. Reweights this count matrix with PPMI.\n",
    "\n",
    "1. Evaluates this reweighted matrix using `full_word_similarity_evaluation`. The return value of `run_giga_ppmi_baseline` should be the return value of this call to `full_word_similarity_evaluation`.\n",
    "\n",
    "The goal of this question is to help you get more familiar with the code in `vsm` and the function `full_word_similarity_evaluation`.\n",
    "\n",
    "The function `test_run_giga_ppmi_baseline` can be used to test that you've implemented this specification correctly."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "def run_giga_ppmi_baseline():\n",
    "    pass\n",
    "    ##### YOUR CODE HERE\n",
    "\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "def test_run_giga_ppmi_baseline(func):\n",
    "    \"\"\"`func` should be `run_giga_ppmi_baseline\"\"\"\n",
    "    result = func()\n",
    "    ws_result = result.loc['wordsim353'].round(2)\n",
    "    ws_expected = 0.58\n",
    "    assert ws_result == ws_expected, \\\n",
    "        \"Expected wordsim353 value of {}; got {}\".format(\n",
    "            ws_expected, ws_result)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "if 'IS_GRADESCOPE_ENV' not in os.environ:\n",
    "    test_run_giga_ppmi_baseline(run_giga_ppmi_baseline)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Gigaword with LSA at different dimensions [0.5 points]\n",
    "\n",
    "We might expect PPMI and LSA to form a solid pipeline that combines the strengths of PPMI with those of dimensionality reduction. However, LSA has a hyper-parameter $k$ – the dimensionality of the final representations – that will impact performance. For this problem, write a wrapper function `run_ppmi_lsa_pipeline` that does the following:\n",
    "\n",
    "1. Takes as input a count `pd.DataFrame` and an LSA parameter `k`.\n",
    "1. Reweights the count matrix with PPMI.\n",
    "1. Applies LSA with dimensionality `k`.\n",
    "1. Evaluates this reweighted matrix using `full_word_similarity_evaluation`. The return value of `run_ppmi_lsa_pipeline` should be the return value of this call to `full_word_similarity_evaluation`.\n",
    "\n",
    "The goal of this question is to help you get a feel for how much LSA alone can contribute to this problem. \n",
    "\n",
    "The  function `test_run_ppmi_lsa_pipeline` will test your function on the count matrix in `data/vsmdata/giga_window20-flat.csv.gz`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "def run_ppmi_lsa_pipeline(count_df, k):\n",
    "    pass\n",
    "    ##### YOUR CODE HERE\n",
    "\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "def test_run_ppmi_lsa_pipeline(func):\n",
    "    \"\"\"`func` should be `run_ppmi_lsa_pipeline`\"\"\"\n",
    "    giga20 = pd.read_csv(\n",
    "        os.path.join(VSM_HOME, \"giga_window20-flat.csv.gz\"), index_col=0)\n",
    "    results = func(giga20, k=10)\n",
    "    men_expected = 0.57\n",
    "    men_result = results.loc['men'].round(2)\n",
    "    assert men_result == men_expected,\\\n",
    "        \"Expected men value of {}; got {}\".format(men_expected, men_result)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "if 'IS_GRADESCOPE_ENV' not in os.environ:\n",
    "    test_run_ppmi_lsa_pipeline(run_ppmi_lsa_pipeline)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Gigaword with GloVe [0.5 points]\n",
    "\n",
    "Can GloVe improve over the PPMI-based baselines we explored above? To begin to address this question, let's run GloVe and see how performance on our task changes throughout the optimization process.\n",
    "\n",
    "__Your task__: write a function `run_glove_wordsim_evals` that does the following:\n",
    "\n",
    "1. Has a parameter `n_runs` with default value `5`.\n",
    "\n",
    "1. Reads in `data/vsmdata/giga_window5-scaled.csv.gz`.\n",
    "\n",
    "1. Creates a `TorchGloVe` instance with `warm_start=True`, `max_iter=50`, and all other parameters set to their defaults.\n",
    "\n",
    "1. `n_runs` times, calls `fit` on your model and, after each, runs `full_word_similarity_evaluation` with default keyword parameters, extract the 'Macro-average' score, and add that score to a list.\n",
    "\n",
    "1. Returns the list of scores created.\n",
    "\n",
    "The trend should give you a sense for whether it is worth running GloVe for more iterations.\n",
    "\n",
    "Some implementation notes:\n",
    "\n",
    "* `TorchGloVe` will accept and return `pd.DataFrame` instances, so you shouldn't need to do any type conversions.\n",
    "\n",
    "* Performance will vary a lot for this function, so there is some uncertainty in the testing, but `run_glove_wordsim_evals` will at least check that you wrote a function with the right general logic."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "def run_glove_wordsim_evals(n_runs=5):\n",
    "\n",
    "    from torch_glove import TorchGloVe\n",
    "\n",
    "    ##### YOUR CODE HERE\n",
    "\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "def test_run_small_glove_evals(data):\n",
    "    \"\"\"`data` should be the return value of `run_glove_wordsim_evals`\"\"\"\n",
    "    assert isinstance(data, list), \\\n",
    "        \"`run_glove_wordsim_evals` should return a list\"\n",
    "    assert all(isinstance(x, float) for x in data), \\\n",
    "        (\"All the values in the list returned by `run_glove_wordsim_evals` \"\n",
    "         \"should be floats.\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "if 'IS_GRADESCOPE_ENV' not in os.environ:\n",
    "    glove_scores = run_glove_wordsim_evals()\n",
    "    print(glove_scores)\n",
    "    test_run_small_glove_evals(glove_scores)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Dice coefficient [0.5 points]\n",
    "\n",
    "Implement the Dice coefficient for real-valued vectors, as\n",
    "\n",
    "$$\n",
    "\\textbf{dice}(u, v) = \n",
    "1 - \\frac{\n",
    "  2 \\sum_{i=1}^{n}\\min(u_{i}, v_{i})\n",
    "}{\n",
    "    \\sum_{i=1}^{n} u_{i} + v_{i}\n",
    "}$$\n",
    " \n",
    "You can use `test_dice_implementation` below to check that your implementation is correct."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "def dice(u, v):\n",
    "    pass\n",
    "    ##### YOUR CODE HERE\n",
    "\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "def test_dice_implementation(func):\n",
    "    \"\"\"`func` should be an implementation of `dice` as defined above.\"\"\"\n",
    "    X = np.array([\n",
    "        [  4.,   4.,   2.,   0.],\n",
    "        [  4.,  61.,   8.,  18.],\n",
    "        [  2.,   8.,  10.,   0.],\n",
    "        [  0.,  18.,   0.,   5.]])\n",
    "    assert func(X[0], X[1]).round(5) == 0.80198\n",
    "    assert func(X[1], X[2]).round(5) == 0.67568"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "if 'IS_GRADESCOPE_ENV' not in os.environ:\n",
    "    test_dice_implementation(dice)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### t-test reweighting [2 points]\n",
    "\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The t-test statistic can be thought of as a reweighting scheme. For a count matrix $X$, row index $i$, and column index $j$:\n",
    "\n",
    "$$\\textbf{ttest}(X, i, j) = \n",
    "\\frac{\n",
    "    P(X, i, j) - \\big(P(X, i, *)P(X, *, j)\\big)\n",
    "}{\n",
    "\\sqrt{(P(X, i, *)P(X, *, j))}\n",
    "}$$\n",
    "\n",
    "where $P(X, i, j)$ is $X_{ij}$ divided by the total values in $X$, $P(X, i, *)$ is the sum of the values in row $i$ of $X$ divided by the total values in $X$, and $P(X, *, j)$ is the sum of the values in column $j$ of $X$ divided by the total values in $X$.\n",
    "\n",
    "For this problem, implement this reweighting scheme. You can use `test_ttest_implementation` below to check that your implementation is correct. You do not need to use this for any evaluations, though we hope you will be curious enough to do so!"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "def ttest(df):\n",
    "    pass\n",
    "    ##### YOUR CODE HERE\n",
    "\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "def test_ttest_implementation(func):\n",
    "    \"\"\"`func` should be `ttest`\"\"\"\n",
    "    X = pd.DataFrame(np.array([\n",
    "        [  4.,   4.,   2.,   0.],\n",
    "        [  4.,  61.,   8.,  18.],\n",
    "        [  2.,   8.,  10.,   0.],\n",
    "        [  0.,  18.,   0.,   5.]]))\n",
    "    actual = np.array([\n",
    "        [ 0.33056, -0.07689,  0.04321, -0.10532],\n",
    "        [-0.07689,  0.03839, -0.10874,  0.07574],\n",
    "        [ 0.04321, -0.10874,  0.36111, -0.14894],\n",
    "        [-0.10532,  0.07574, -0.14894,  0.05767]])\n",
    "    predicted = func(X)\n",
    "    assert np.array_equal(predicted.round(5), actual)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "if 'IS_GRADESCOPE_ENV' not in os.environ:\n",
    "    test_ttest_implementation(ttest)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Enriching a VSM with subword information [2 points]\n",
    "\n",
    "It might be useful to combine character-level information with word-level information. To help you begin asssessing this idea, this question asks you to write a function that modifies an existing VSM so that the representation for each word $w$ is the element-wise sum of $w$'s original word-level representation with all the representations for the n-grams $w$ contains. \n",
    "\n",
    "The following starter code should help you structure this and clarify the requirements, and a simple test is included below as well.\n",
    "\n",
    "You don't need to write a lot of code; the motivation for this question is that the function you write could have practical value."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "def subword_enrichment(df, n=4):\n",
    "    pass\n",
    "    # 1. Use `vsm.ngram_vsm` to create a character-level\n",
    "    # VSM from `df`, using the above parameter `n` to\n",
    "    # set the size of the ngrams.\n",
    "\n",
    "    ##### YOUR CODE HERE\n",
    "\n",
    "\n",
    "    # 2. Use `vsm.character_level_rep` to get the representation\n",
    "    # for every word in `df` according to the character-level\n",
    "    # VSM you created above.\n",
    "\n",
    "    ##### YOUR CODE HERE\n",
    "\n",
    "\n",
    "    # 3. For each representation created at step 2, add in its\n",
    "    # original representation from `df`. (This should use\n",
    "    # element-wise addition; the dimensionality of the vectors\n",
    "    # will be unchanged.)\n",
    "\n",
    "    ##### YOUR CODE HERE\n",
    "\n",
    "\n",
    "    # 4. Return a `pd.DataFrame` with the same index and column\n",
    "    # values as `df`, but filled with the new representations\n",
    "    # created at step 3.\n",
    "\n",
    "    ##### YOUR CODE HERE\n",
    "\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "def test_subword_enrichment(func):\n",
    "    \"\"\"`func` should be an implementation of subword_enrichment as\n",
    "    defined above.\n",
    "    \"\"\"\n",
    "    vocab = [\"ABCD\", \"BCDA\", \"CDAB\", \"DABC\"]\n",
    "    df = pd.DataFrame([\n",
    "        [1, 1, 2, 1],\n",
    "        [3, 4, 2, 4],\n",
    "        [0, 0, 1, 0],\n",
    "        [1, 0, 0, 0]], index=vocab)\n",
    "    expected = pd.DataFrame([\n",
    "        [14, 14, 18, 14],\n",
    "        [22, 26, 18, 26],\n",
    "        [10, 10, 14, 10],\n",
    "        [14, 10, 10, 10]], index=vocab)\n",
    "    new_df = func(df, n=2)\n",
    "    assert np.array_equal(expected.columns, new_df.columns), \\\n",
    "        \"Columns are not the same\"\n",
    "    assert np.array_equal(expected.index, new_df.index), \\\n",
    "        \"Indices are not the same\"\n",
    "    assert np.array_equal(expected.values, new_df.values), \\\n",
    "        \"Co-occurrence values aren't the same\""
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "if 'IS_GRADESCOPE_ENV' not in os.environ:\n",
    "    test_subword_enrichment(subword_enrichment)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Your original system [3 points]\n",
    "\n",
    "This question asks you to design your own model. You can of course include steps made above (ideally, the above questions informed your system design!), but your model should not be literally identical to any of the above models. Other ideas: retrofitting, autoencoders, GloVe, subword modeling, ... \n",
    "\n",
    "Requirements:\n",
    "\n",
    "1. Your code must operate on one or more of the count matrices in `data/vsmdata`. You can choose which subset of them; this is an important design feature of your system. __Other pretrained vectors cannot be introduced__.\n",
    "\n",
    "1. Retrofitting is permitted.\n",
    "\n",
    "1. Your code must be self-contained, so that we can work with your model directly in your homework submission notebook. If your model depends on external data or other resources, please submit a ZIP archive containing these resources along with your submission.\n",
    "\n",
    "In the cell below, please provide a brief technical description of your original system, so that the teaching team can gain an understanding of what it does. This will help us to understand your code and analyze all the submissions to identify patterns and strategies. We also ask that you report the best score your system got during development, just to help us understand how systems performed overall."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# PLEASE MAKE SURE TO INCLUDE THE FOLLOWING BETWEEN THE START AND STOP COMMENTS:\n",
    "#   1) Textual description of your system.\n",
    "#   2) The code for your original system.\n",
    "#   3) The score achieved by your system in place of MY_NUMBER.\n",
    "#        With no other changes to that line.\n",
    "#        You should report your score as a decimal value <=1.0\n",
    "# PLEASE MAKE SURE NOT TO DELETE OR EDIT THE START AND STOP COMMENTS\n",
    "\n",
    "# NOTE: MODULES, CODE AND DATASETS REQUIRED FOR YOUR ORIGINAL SYSTEM \n",
    "# SHOULD BE ADDED BELOW THE 'IS_GRADESCOPE_ENV' CHECK CONDITION. DOING\n",
    "# SO ABOVE THE CHECK MAY CAUSE THE AUTOGRADER TO FAIL.\n",
    "\n",
    "# START COMMENT: Enter your system description in this cell.\n",
    "# My peak score was: MY_NUMBER\n",
    "if 'IS_GRADESCOPE_ENV' not in os.environ:\n",
    "    pass\n",
    "\n",
    "# STOP COMMENT: Please do not remove this comment."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Bake-off [1 point]\n",
    "\n",
    "For the bake-off, we will release two additional datasets. The announcement will go out on the discussion forum. We will also release reader code for these datasets that you can paste into this notebook. You will evaluate your custom model $M$ (from the previous question) on these new datasets using `full_word_similarity_evaluation`. Rules:\n",
    "\n",
    "1. Only one evaluation is permitted.\n",
    "1. No additional system tuning is permitted once the bake-off has started.\n",
    "\n",
    "The cells below this one constitute your bake-off entry.\n",
    "\n",
    "People who enter will receive the additional homework point, and people whose systems achieve the top score will receive an additional 0.5 points. We will test the top-performing systems ourselves, and only systems for which we can reproduce the reported results will win the extra 0.5 points.\n",
    "\n",
    "Late entries will be accepted, but they cannot earn the extra 0.5 points. Similarly, you cannot win the bake-off unless your homework is submitted on time.\n",
    "\n",
    "The announcement will include the details on where to submit your entry."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Enter your bake-off assessment code into this cell.\n",
    "# Please do not remove this comment.\n",
    "if 'IS_GRADESCOPE_ENV' not in os.environ:\n",
    "    pass\n",
    "    # Please enter your code in the scope of the above conditional.\n",
    "    ##### YOUR CODE HERE\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# On an otherwise blank line in this cell, please enter\n",
    "# your \"Macro-average\" value as reported by the code above.\n",
    "# Please enter only a number between 0 and 1 inclusive.\n",
    "# Please do not remove this comment.\n",
    "if 'IS_GRADESCOPE_ENV' not in os.environ:\n",
    "    pass\n",
    "    # Please enter your score in the scope of the above conditional.\n",
    "    ##### YOUR CODE HERE\n"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.7.7"
  },
  "widgets": {
   "state": {},
   "version": "1.1.2"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}
